Similarity Based Deduplication with Small Data Chunks

نویسندگان

  • Lior Aronovich
  • Ron Asher
  • Danny Harnik
  • Michael Hirsch
  • Shmuel Tomi Klein
  • Yair Toaff
چکیده

Large backup and restore systems may have a petabyte or more data in their repository. Such systems are often compressed by means of deduplication techniques, that partition the input text into chunks and store recurring chunks only once. One of the approaches is to use hashing methods to store fingerprints for each data chunk, detecting identical chunks with very low probability for collisions. As alternative, it has been suggested to use similarity instead of identity based searches, which allows the definition of much larger chunks. This implies that the data structure needed to store the fingerprints is much smaller, so that such a system may be more scalable than systems built on the first approach. This paper deals with an extension of the second approach to systems in which it is still preferred to use small chunks. We describe the design choices made during the development of what we call an approximate hash function, serving as the basic tool of the new suggested deduplication system and report on extensive tests performed on an variety of large input files.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Cryptographic Hashing Method using for Secure and Similarity Detection in Distributed Cloud Data

Received Jun 29, 2017 Revised Nov 23, 2017 Accepted Dec 17, 2017 The explosive increase of data brings new challenges to the data storage and supervision in cloud settings. These data typically have to be processed in an appropriate fashion in the cloud. Thus, any improved latency may origin animmense loss to the enterprises. Duplication detection plays a very main role in data management. Data...

متن کامل

Controlling the Chunk-Size in Deduplication Systems

A special case of data compression in which repeated chunks of data are stored only once is known as deduplication. The input data is cut into chunks and a cryptographically strong hash value of each (different) chunk is stored. To restrict the influence of small inserts and deletes to local perturbations, the chunk boundaries are usually defined in a data dependent way, which implies that the ...

متن کامل

A Cost-efficient Rewriting Scheme to Improve Restore Performance in Deduplication Systems

In chunk-based deduplication systems, logically consecutive chunks are physically scattered in different containers after deduplication, which results in the serious fragmentation problem. The fragmentation significantly reduces the restore performance due to reading the scattered chunks from different containers. Existing work aims to rewrite the fragmented duplicate chunks into new containers...

متن کامل

Target Deduplication Metrics and Risk Analysis Using Post Processing Methods

In modern intelligent storage technologies deduplication of data is a data compression technique used for discarding duplicate copies of repeating data. It is used to improve storage utilization and applied to huge network data transfers to reduce the number of bytes which is to be transferred. In this process, similar data chunks, patterned bytes, are classified and stored at this stage. As a ...

متن کامل

Robust Inline Data Reduction Technique in Multi-tenant Storage

Data deduplication has gained increasing popularity as a space-reduction approach in backup storage systems. One of the main challenges for centralized data deduplication is the scalability of fingerprintindex search. In existing system, deduplication mainly focuses on backup system. In this paper, we propose a system that effectively exploits similarity and locality of data blocks to achieve h...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012